textual label
6 Appendix
As described in 3, the MemRecall is the process to extract the key blocks. We also need "strides" as BM25 is a famous TF-IDF-like information retrieval method. Each block is scored based on the common words with query or textual label. However, the semantic relevance are neglected. Glove is a group of pretrained word representation.
Creating User-steerable Projections with Interactive Semantic Mapping
Oliveira, Artur André, Espadoto, Mateus, Hirata, Roberto Jr., Cesar, Roberto M. Jr., Telea, Alex C.
Dimensionality reduction (DR) techniques map high-dimensional data into lower-dimensional spaces. Yet, current DR techniques are not designed to explore semantic structure that is not directly available in the form of variables or class labels. We introduce a novel user-guided projection framework for image and text data that enables customizable, interpretable, data visualizations via zero-shot classification with Multimodal Large Language Models (MLLMs). We enable users to steer projections dynamically via natural-language guiding prompts, to specify high-level semantic relationships of interest to the users which are not explicitly present in the data dimensions. We evaluate our method across several datasets and show that it not only enhances cluster separation, but also transforms DR into an interactive, user-driven process. Our approach bridges the gap between fully automated DR techniques and human-centered data exploration, offering a flexible and adaptive way to tailor projections to specific analytical needs.
- Europe > Netherlands (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Portugal > Coimbra > Coimbra (0.04)
- Research Report (0.64)
- Overview (0.46)
Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance
Zhang, Yaoyun, Xu, Xuenan, Wu, Mengyue
The video-to-audio (V2A) generation task has drawn attention in the field of multimedia due to the practicality in producing Foley sound. Semantic and temporal conditions are fed to the generation model to indicate sound events and temporal occurrence. Recent studies on synthesizing immersive and synchronized audio are faced with challenges on videos with moving visual presence. The temporal condition is not accurate enough while low-resolution semantic condition exacerbates the problem. To tackle these challenges, we propose Smooth-Foley, a V2A generative model taking semantic guidance from the textual label across the generation to enhance both semantic and temporal alignment in audio. Two adapters are trained to leverage pre-trained text-to-audio generation models. A frame adapter integrates high-resolution frame-wise video features while a temporal adapter integrates temporal conditions obtained from similarities of visual frames and textual labels. The incorporation of semantic guidance from textual labels achieves precise audio-video alignment. We conduct extensive quantitative and qualitative experiments. Results show that Smooth-Foley performs better than existing models on both continuous sound scenarios and general scenarios. With semantic guidance, the audio generated by Smooth-Foley exhibits higher quality and better adherence to physical laws.
The Solution for Language-Enhanced Image New Category Discovery
Xu, Haonan, Chao, Dian, Wu, Xiangyu, Wan, Zhonghua, Yang, Yang
Treating texts as images, combining prompts with textual labels for prompt tuning, and leveraging the alignment properties of CLIP have been successfully applied in zero-shot multi-label image recognition. Nonetheless, relying solely on textual labels to store visual information is insufficient for representing the diversity of visual objects. In this paper, we propose reversing the training process of CLIP and introducing the concept of Pseudo Visual Prompts. These prompts are initialized for each object category and pre-trained on large-scale, low-cost sentence data generated by large language models. This process mines the aligned visual information in CLIP and stores it in class-specific visual prompts. We then employ contrastive learning to transfer the stored visual information to the textual labels, enhancing their visual representation capacity. Additionally, we introduce a dual-adapter module that simultaneously leverages knowledge from the original CLIP and new learning knowledge derived from downstream datasets. Benefiting from the pseudo visual prompts, our method surpasses the state-of-the-art not only on clean annotated text data but also on pseudo text data generated by large language models.
- South America > Colombia > Meta Department > Villavicencio (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
Better Few-Shot Relation Extraction with Label Prompt Dropout
Few-shot relation extraction aims to learn to identify the relation between two entities based on very limited training examples. Recent efforts found that textual labels (i.e., relation names and relation descriptions) could be extremely useful for learning class representations, which will benefit the few-shot learning task. However, what is the best way to leverage such label information in the learning process is an important research question. Existing works largely assume such textual labels are always present during both learning and prediction. In this work, we argue that such approaches may not always lead to optimal results. Instead, we present a novel approach called label prompt dropout, which randomly removes label descriptions in the learning process. Our experiments show that our approach is able to lead to improved class representations, yielding significantly better results on the few-shot relation extraction task.
- Asia > Singapore (0.04)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (2 more...)
- Research Report > Promising Solution (0.34)
- Research Report > New Finding (0.34)
- Research Report > Experimental Study (0.34)
Zero-Shot Audio Classification Based on Class Label Embeddings
This paper proposes a zero-shot learning approach for audio classification based on the textual information about class labels without any audio samples from target classes. We propose an audio classification system built on the bilinear model, which takes audio feature embeddings and semantic class label embeddings as input, and measures the compatibility between an audio feature embedding and a class label embedding. We use VGGish to extract audio feature embeddings from audio recordings. We treat textual labels as semantic side information of audio classes, and use Word2Vec to generate class label embeddings. Results on the ESC-50 dataset show that the proposed system can perform zero-shot audio classification with small training dataset. It can achieve accuracy (26 % on average) better than random guess (10 %) on each audio category. Particularly, it reaches up to 39.7 % for the category of natural audio classes.
A Preliminary Analysis and Catalog of Thematic Labels
Wagner, Earl J. (University of Maryland, College Park)
An account of the labels commonly used to express themes could both help in assessing the coverage of models of narrative processing, and support recognizing themes by the textual appearance of these labels. This paper presents a preliminary analysis and catalog of thematic labels such as “vicious cycle” and “underdog”. In contrast to a top-down approach characterizing themes in terms of components of a model of narrative processing, a bottom-up approach is taken. Thematic labels are gathered independent of any particular model and they are catalogued according to the types of relationships the corresponding themes convey.
- North America > United States > Maryland > Prince George's County > College Park (0.15)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)